The post-training data primarily consists of two components: demonstration data 𝒟={(xi,yi)}𝒟subscript𝑥𝑖subscript𝑦𝑖\mathcal{D}=\{(x_{i},y_{i})\}caligraphic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } and preference data 𝒫={(xi,yi+,yi−)}𝒫subscript𝑥𝑖superscriptsubscript𝑦𝑖superscriptsubscript𝑦𝑖\mathcal{P}=\{(x_{i},y_{i}^{+},y_{i}^{-})\}caligraphic_P = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) }, where xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the instruction, yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a satisfactory response, and yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT and yi−superscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT are two responses to xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, with yi+superscriptsubscript𝑦𝑖y_{i}^{+}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT being the preferred choice over yi−superscriptsubscript𝑦𝑖y_{i}^{-}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT.The set 𝒟𝒟\mathcal{D}caligraphic_D is utilized in SFT, whereas 𝒫𝒫\mathcal{P}caligraphic_P is employed in RLHF.
The construction of training data entails a two-step process: collaborative data annotation and automated data synthesis.First, we extract the data ontology from large-scale instruction corpora, leading to a broad and diverse set of high-quality instructions.These instructions are systematically enhanced to incorporate greater complexity.Through human annotation, we obtain the target response yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and their positive and negative counterparts (yi+,yi−)superscriptsubscript𝑦𝑖superscriptsubscript𝑦𝑖(y_{i}^{+},y_{i}^{-})( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ).Subsequently, a variety of automated alignment strategies are employed to synthesize a substantial volume of artificially annotated data across the domains of code, mathematics, instruction-following, creation, role-playing, and safety.
4.1.1 Collaborative Data AnnotationAutomatic Ontology ExtractionThe process initiates with the application of InsTag (Lu et al., 2024c), an open-set fine-grained tagger, to extract the underlying ontology from a large-scale instruction dataset.Subsequent manual refinement ensures the accuracy of the extracted ontology.
Instruction SelectionEach instruction, with tags annotated, is evaluated for tag diversity, semantic richness, complexity, and intent completeness.Based on these criteria, we select a set of representative instructions (Dong et al., 2023).
Instruction EvolutionTo enrich the instruction dataset, a self-evolution strategy (Zhao et al., 2024) is employed, prompting the Qwen models to add constraints or requirements to existing instructions, thereby increasing their complexity and ensuring a diverse range of difficulty levels within the dataset.
Human AnnotationMultiple responses to an instruction are obtained using diverse generation strategies and Qwen models of different scales.Annotators rank these responses based on their preferences, ensuring the best response meets established criteria, yielding both demonstration and preference data.
4.1.2 Automated Data SynthesisMaintaining the quality of annotations for responses to instructions presents significant challenges on a large scale, particularly those that require expertise, experience, carefulness, or patience.To address these challenges, we devised various automated alignment strategies to synthesize data at scale.
Rejection SamplingFor mathematical or similar tasks with definitive final answers, rejection sampling (Yuan et al., 2023) is applied to improve the quality of solutions.Large language models (LLMs) are tasked to generate multiple responses, namely the reasoning paths, for each instruction.Paths that result in accurate conclusions and are considered reasonable by the model are preserved, serving as demonstration data.Preference data is generated by contrasting correct and incorrect paths.
Execution FeedbackFor coding tasks, LLMs are employed to generate solutions and associated test cases.The efficacy of these solutions is evaluated by compiling and executing them against the test cases, thereby creating demonstration and preference data.This methodology is also applicable to assessing instruction following (Dong et al., 2024).For each instruction with constraints, e.g., length limit, the LLM is tasked to generate a Python verification function to ensure the response aligns with the instruction requirements.
Data RepurposingCreating skilled responses in literary writing tasks is challenging for annotators without specialized training.To tackle this problem, we aggregate high-quality literary works from the public domain and employ LLMs to develop instructions with varying levels of detail.These instructions, paired with the original works, serve as demonstration data.For example, to compile roleplay data with vivid and engaging responses, we source detailed character profiles from knowledge repositories such as Wikipedia and instruct LLMs to generate corresponding instructions and responses (Lu et al., 2024b).This process, similar to a reading comprehension task, ensures that the integrity of the character’s profile is maintained.
Constitutional FeedbackConstitutional AI refers to the process of guiding LLMs to generate responses based on predefined sets of principles (Bai et al., 2022).To ensure adherence to guidelines such as safety and values, a constitution dataset was compiled.This dataset delineates principles to be followed and those to be avoided.It was used to instruct LLMs to produce responses that either are aligned with or deviated from these guidelines, serving as a reference for demonstration and preference data.